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Abstract 

The focus of this study was on the visual analysis of time series 
data for evaluating educational programs. Two characteristics of the 
data— changes in slope and variabil ity— and two characteristics of 
evaluation— training in data util ization and the use of 
aimlines/decision rules— were manipulated. A total of 51 students 
and/or teachers in education evaluated a set of 28 graphs on two 
dimensions: (a) Was the program depicted on the graph an effective 
program? and (b) What about the data supported such a conclusion? 
Findings of the study indicated that visual analysis is not very 
rel iable for evaluating educational programs, and is influenced 
considerably by the characteristics of the data array (specifically, 
slope and variability). Training in data utilization or the use of 
aimlines did not appear to be particularly powerful procedures for 
improving visual analysis. At the same time, the findings indicated 
evaluation consistent with established data analysis paradigms. 
Implications for training in visual analysis are discussed. 
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Visual Analysis of Time Series Data: 
Factors of Influence and Level of Reliability 

While behavioral Interventions have been well documented 4 and 
empirically supported over the past several decades, the appropriate 
analysis of behavioral data has not been explicated with the same 
results. Most behavioral research is based upon, Indeed predicated 
upon, the use of time-serjes data, which typically are graphed on 
either equal-Interval or semi-log graphs. Until recently, there have 
been ver few methods available for analyzing such data. 

The historical roots of the experimental analysis of behavior 

(Sidman, 1960; Skinner, 1953) have held theoretical and practical sway 

against the use of statistics in data analysis. The visual analysis 

of graphed data has been the most accepted basis for judgments of the 

adequacy and meaningfulness of interventions: 

determination of change is dependent on the change being of 
sufficient magnitude to be apparent to the eye.. Compared 
with the potential algebraic sophistication of statistical 
tests of significance, (not always realized in practice), 
the above procedure usually is relatively insensitive. 
(Parsonson & Baer, 1978, p. Ill) 

That is, it is contended that the reliance upon visual analysis, an 

admittedly less sensitive measurement technique, results in an 

inherent bias against the selection of weak and unstable variables 

(Baer, Wolf, & Rislsy, 1968). Minor effects -are not seen as "change;" 

There is a very low probability of Type I errors and consequently 

a high probability of Type II errors. Type I errors . result when a 

conclusion of "an effect" is made when in actuality no effect is 

present. Type II errors represent an error in the opposite direction: 

a conclusion of "no effect" is made when, in reality, an effect is 



present. Pechacek (1978) investigated the validity of N«l designs 
(both reversal and multiple baseline designs) by means of a 
probabilistic model using visual analysis of effects as his criterion. 
Given three possible outcomes (Increase, decrease, or no change in 
behavior), both the basic ABAB design and multiple baseline designs 
using four baselines were found to possess a probability estimate of 
Type I error well below the traditional 1 .05 level. Although 
statistical analysis would often simply corroborate such findings, it 
is also true that effects would have been found for many less powerful 
and stable variables, serving "only to confound, complicate, and delay 
the development of a functional analysis of behavior" (Pars.pnson & 
Baer, 1978, p. 113). It. is quite likely that if statistical analysis 
is. needed to demonstrate certain effects, there will be problems in 
replicating those effects later (Kazdin, 1976). 

Furthermore, as Michael (1974) notes, an emphasis on statistics 
and the elaboration of statistical control of unwanted sources of 
variation in the dependent variable, will likely result in a/reduction 
in the necessity for developing experimental control. The harmful 
consequences engendered in devoting more time and effort to the use of 
statistics include the loss of a source of ideas for further 
experimentation, reliance upon less useful knowledge having limited 
applicability, the design of experiments having less generality and 
"replicability;" excessive dependence upon statistical tests of 
significance, and experiments being designed in more complex and less 
flPxible manners. In the final analysis, he believes that time spent 
in learning how' to use and interpret statistical procedures will 



simply take time away from the primary' subject of Interest, a 
functional analysis of behavior. 

An unfortunate side effect of this controversy 1s that far more 
effort has been Invested 1n the development of statistical procedures 
than in explicating Important variables 1n visual analysis. This is 
an important line of research in which more attention needs to be 
given to the technology of graphing and the development of major 
guidelines for use in "seeing change." Although visual, analysis has 
been the most frequently used procedure, for data analysis in applied 
behavioral research, there is little empirical evidence regarding its 
technical adequacy. Most of the studies that have been conducted have 
compared visual analysis with statistical analysis. This research 
indicates that inconsistent conclusions have occurred in judging N=l 
data through visual analysis when compared to statistical inference 
criteria (DeProspero & Cohen, 1979; Glass, WMlson, & Gottman, 1975; 
Jones, Vaught, & Weinrott, 1977; Jones, Weinrott, & Vaught, 1975). 

These investigations clearly demonstrate that visual analysis of 
time series data may be suspect when compared to statistical analysis. 
However, it also is important to know what aspects threaten the 
statistical conclusion validity (Cook & Campbell, 1976) of this type 
of analysis. A better understanding of the components of visual 
analysis would provide a basis for improving its accuracy and 
reliability. Three studies have been conducted with this purpose 1n 
miJld. .... ... 

DeProspero and Cohen (1979) investigated the degree to which 
agreement in visual judgment could be attributed reliably to certain 



features of the graph. Using a set of simulated "ABAS reversal 
deslgh" graphs, they systematically varied the pattern and degree of 
mean shift across phases, variability within phases, and trend* Their 
results Indicated "that the pattern of mean shift was a critical 
characteristic, with the average rating of effectiveness falling off 
very rapidly for any pattern other than the "ideal" one of change 
congruent with the hypothesized effect. They also found the degree of 
mean shift to have a reliable effect upon the average rating. The 
interrater agreement of the judges in this study was .61 overall; no 
data were . reported on reliability within each of the graphic 
characteristics investigated. The evaluative criteria employed by the 
judges fell into four cluster statements. Most frequently mentioned 
was the topography of the scores - their trend, means, and stability. 
The format of presentation was mentioned next most frequently, 
f ol lowed by intra- and extra-experimental concerns . Although 
DeProspero and Cohen (1979) "attempted to assess the factors 
contributing to reliable or unreliable visual judgment," they 
concluded that "graphic characteristics appear to determine judgments 
in concert rather than singly" (p. 578). 

An investigation to ascertain the extent to which serial 
dependency influenced the agreement between inferences based on visual 
or time-series analysis was conducted by Jones, Weinrott, ^ and Vaught 
(1978). JABA graphs were presented to judges well versed in behavior 
charting and they were asked whether a meaningful change in level had 
been demonstrated from one phase to another. The authors selected 
graphs in which the effects were sufficiently "nonobvious" to Warrant 



critical analysis, and serial dependency was apparent. The graphs 
were blocked further Into three different levels of serial dependency 
by two levels of significance of difference 1n level between phases. 
Their results indicated that agreement between visual analysis and 
t1me-ser1es analysis were Inversely related to the magnitude of the 
serial dependency In the scores^ That 1s, the more serial dependency 
present (with a significant difference In level), the less reliable 
visual analysis tended to be. Furthermore, they found that visual and 
t1me-ser1es Inferences agreed better when the statistical test showed 
non-significant changes in level than when significant changes in 
level were indicated. Finally, an interaction effect was present 1n 
which visual and time-series inferences agreed most when the data 
showed neither serial dependency nor significant differences in level. 
In effect, judges tended to agree with time-series analysis that no 
effect was present but disagreed most when an effect was present. 
Intercorrelations among the 11 judges ranged from .04 to .79, with a 
median of .39, suggesting fairly low consensus among judges and 
indicting the dependability of visual inferences. However, there was 
no relationship found between the reliability of the judges and the 
degree of agreement with time-series inferences. 

Jones et al. (1978) consider their findings (low agreement with 
high serial dependency and statistically reliable changes in level, 
and high agreement with low serial dependency and unreliable changes 
in level) to be contrary to the unlikely and/or undesirable purpose of 
research using an operant paradigm. They conclude that "statistically 
reliable experimental effects may be more often overlooked by visual 



10 



appraisals of data than nonmeanlngful effects 11 (p. 280). Their 
suggestion to use time-series analysis to supplement visual analysis 
(Jones et al., 1977) would result 1n an Increase 1n the number of 
meaningful changes Inferred. A , 

The final study (Wampold & Furlong, 1981) of visual Inference 
focused on an explication based on schema theory. It was hypothesized 
that the process of visually analyzing time series data was primarily 
a classification problem controlled by previous training 1n visual 
inference through the use of model data - prototypes and exemplars. 
Furthermore, this training typically has been characterized by the 
presentation of prototypes and exemplars demonstrating large changes 
and little variability (small distance exemplars) in single subject 
designs. 

The primary purpose of the study by Wampold and Furlong was to 
compare graph analyses of subjects trained in different analytic 
procedures. Specifically, it was hypothesized that subjects trained 
in behavior analysis (with a focus on prototypes and small distance 
exemplars) would analyze graphed data differently than subjects 
trained in advanced statistical procedures (having little or no 
contact with the prototype or exemplars typically found .in the' 
behavioral literature). Additionally, the ability to discriminate 
between different intervention effects was investigated by analyzing' 
differential reactions to graphs that demonstrated either a change in 
level, a change in trend, or a change in both level and trend. 

The stimulus materials to which all subjects responded included a 
series of three graphs, two of which were kept functionally equivalent 



(had the same size of Intervention effect In relation to the 
variation), and the third depicting a smaller Intervention effect 
(relative to the variability). In addition, each of these three types 
of graphs displayed a change from phase 1 to phase 2 1n: (a) level, 
(b) trend, or (c) level and trend. 

The results from this research provided support "for the 
hypothesis that subjects trained primarily 1n visual Inference would 
be more p,rone to attend to large differences while Ignoring variation 
1n graphic data than would subjects primarily trained 1fl statistics 11 
(p. 89). Additionally, 1t was determined that the subjects trained 
in visual inference were less able to differentiate the intervention 
effects than were the subjects trained 1n classical statistical 
procedures. It must be noted, however, that neither group performed 
the sorting task exceptionally well, with only 36% of the N=l subjects 
and 50% of the statistically trained group responding appropriately to 
the experimental stimuli. 

In summary, it appears that visual analysis of time series data 
is a tenuous proposition at best and is influenced negatively by such 
characteristics of the data as: (a) nonconformity to an Ideal and 
hypothesized pattern; (b) serial dependency; and (c) variability 
relative to certain changes in slope and trend. However, the studies 
have differed in major ways, including the stimulus materials used in 
the research and the population of subjects examined. The -purpose of 
this research was to examine another variation in methodology and to 
focus on additional characteristics of time series data that influence 
visual analysis. The main reason for this pursuit is related to the 
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shortcomings of previous research. The DeProspero and Cohen (1979) 
study provided few meaningful or specific findings having, implications 
for. Improving the visual analysis of time series data. The Jones et 
al. (1978) study manipulated a statistical variable not readily 
amenable to manipulation in the field, although the findings are quite 
relevant. Finally, Wampold and Furlong (1981) looked at e change* 
obetween different time series rather than within various' time series, 
limiting any interpretations that can be made. . 

As important as these methodological considerations are, however, 
the populations of subjects used by these researchers provide. another 
critical reason for conducting further research. ff The subjects in this 
previous research were graduate students and/or professionals with 
considerable experience in data analysis. ^Because of this, this 
previous research simply has been inadequate for answering the 
question of the effects of training on .school teachers. The focus,- of 
the current research was on the interpretation of time series data for 
purposes of making educational decisions involved in program 
evaluation. Therefore,- to provide external validity to this 
investigation, it was imperative" that the population -of subjects 
sampled was appropriate to the population of . public' school educators. 
' Method A 

Subjects 

Subjects for this study were in-service, and pre-service teachers 
from threa different locations around a large midwestern city, Two of 
the sites, were school districts, accounting for nine of the subjects, 
all of whom were currently .teaching. Teachers in these two sites were 



randomly assigned to different treatment conditions, with the three 



to the experimental grou|F and the 
to the control group. Si 



subjects from one district assigned 
six from the other district assigned 

The remaining 42 subjects were students taking a required special 
education class at a large midwestern university. Most" subjects were 
currently teaching or were former teachersT Subjects from' this pool^ 
were randomly assigned to treatment groups in proportion to the number 



the same size. Twenty students 
d 28 assigned to the experimental 



needed for bringing both groups to 
were assigned to the control group an 
group. 

Training Procedures 

The training of subjects involveld both an in-service workshop and" 

— — ^ . < 

a 'take-home 1 training module. The teachers in the/experimental group 
were given training in the , analysis of graphed' data-for evaluating 
instructional programs. This entailed explanations and ex£fcc1s£s/''1n 
summarizing student performance and using it to make inter preta£fort^>^ ) 
Included in the summarization of time-series data were computations of \ 
step changes, medians, slopes (-using the split-middle technique; 
White, ; 1971), variability (using total bounce; Pennypacker, Koenig, & 
Lindsley, 1972), and overlap (Parsohsjon & Baer, 1978). A portion of 



the workshop also was devoted to t^ie use of this .information for 
evaluating instruction. 

The teachers in the control grbup were given training in the 
development* of measurement techniques in the areas of reading, 
writing, and spelling. They were trained in assessing students to 
•determine, performance discrepancies, sampling curriculum materials to 
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find an appropriate instructional level, and developing a measurement 

system to monitor student'omprovement. 

Both workshops- lasted approximately 2h hours. Following the 
vyorkshop, the experimental materials (graphs, response sheets, and 
directions) for 14 graphs were distributed. Following completion of 
these graphs (which ranged from one week for the subjects in the class 
to three weeks for subjects in the schools), a second set of 14 graphs 
were distributed. The completion and return of this material again 
took one week for the subjects in the class and three weeks for those 
in the schools. 

Materials - 

A total of 28 different graphs was constructed in which slope and 
variability were systematically .manipulated. . Two phases were 
displayed in each graph - 11 data points in baseline and 15. data 
points in the intervention phase. A vertical line was drawn 
separating the two phases. The aimline represented a 30% improvement 
over the median of the last three days during baseline. To ensure 
comparability between the graphs with and without aimlines, the 
absolute level of this median value was nearly the same across both 
aimline conditions within each respective level of slope. Although 
the slope was manipulated only in the intervention phase, variability 
was, manipulated in both baseline and during the intervention. A total 
of three levels of slope and four conditions of variability were 

r 

included in the. graphs. (Details of the procedures for constructing 
the graphs are presented in Appendix A.) 

. 0 

With variability' manipulated in both baseline and intervention, 
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wo different combinations of variability were included: . a bounce of 
d^ta points and one \p£ 15 data points. For every combination of 
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slope, varia¥iijity increased (5-15), decreased (15-5), remained at the 

same low leveU(5-5) or remained at the same high level (15-15).- This 

X V \ ^ 

resulted i^theXq/l lowing combinations of graphed data: 

(a) Six graphs showed an increase in variabilityfrenT' 
baseline to - intervention from 5 data poiofes^ounce to 
15 data points bounce, with a concurretvt increase in 
slope from 0 to 10 degrees for 2 gnaphs, an increase 
from 0 to 15 degrees in two graphs, and an increase from 
0 to 20 degrees for th^final two graphs. Of these six 
graphs, three had a^-^nmline drawn in during the 
intervention phase', one for each combination of slope 
and variability.' * 

(b) Six ^gr^phs showed a decrease in variability from 15 
^data points bounce in baseline to 5 data points bounce 

in the intervention phase. For two of these graphs, 
the change in slope from baseline to intervention 
involved an increase from 0 to 10 degrees, two graphs 
depicted an increase from 0 to 15 degrees, and two had' 
an increase from 0 to 20 degrees. Again, an aimline 
was drawn in on half (three) of the above graphs, one 
from each combination. 

(c) Six graphs showed steady (unchanging) variability 
at a low level (5. data points bounce) from baseline 
to intervention. Again, the slope changed from 0 
to 10 degrees ort two of the graphs, 0 to 15 degrees 
on two graphs, and 0 to 20 degrees on the final two 
graphs. For each pair of slope-variability, one had 
an aimline and one did not. 

(d) Six graphs showed steady (unchanging) variability . 
at a high level (15 data points bounce) from baseline 
to intervention. Each level of slope (10, 15, and 20 
degrees) was represented; two graphs displayed a change 
of ^0 to 10 degrees', two graphs displayed a change of 

0 to 15 degrees, and two showed a change of 0 to 20 
degrees/ Again, aimlines were present on half of 
these - one in eacfr combination of slope-variability. 

The final four graphs had the following characteristics: 

(e) Four graphs which were given at time 1 were again 
given at time 2, with exactly the same data array, 
depicted. All of these graphs displayed a low slope 
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change (0 to 10 degrees) and constant variability 
(either the same low or same high variability). For 
each of the two variability conditions, one had an 
aimline present and one had no aimlfne present. 

Dependent Variables 

As noted previously, each subject was given 14 of the graphs 

immediately following training. Each graph had a response sheet which 

included two primary questions (see Appendix B): 

(1) Was the intervention depicted on the graph an 
effective one? Response to this question consisted pf- : 
rating the effectiveness on. a 1-4 scale, with 1 being 
definitely not effective and 4 being definitely 
effective. 

(2) What about the data led them to the above conclusion? 
Response to this question was a short answer, 
description of anything in the data array that they were 
particularly attentive to while making their judgment. 

After the first set of 14 graphs and responses were collected, another 

set of 14 graphs was distributed. The order in which the graphs were 

organized, (and completed) was determined randomly for both groups of 

subjects. ^ 

Results 

What Influence Does Slope and Variability have on Ratings of 
Intervention Effectiveness? 

The average ratings of intervention effectiveness are summarized 

in Table 1. A significant difference was found between the three 

levels of slope, F(2,98) = 116.4, p < .000, and the. four conditions of 

variability, F(3,147) = 14.2, 'p < .000, as well as the interaction 

between slope and variability, F(6,294) = 22.8, p ± .000. The average 

ratings for the three levels of slope, increased foonotonical ly for 10, 

15, and 20 degrees, respectively. For the four conditions of 
variability, the average ratings were higher for decreased variability 



(2.81) and high constant variability (2.74), and lower for increased 
variability (2.47) and low constant variability (2.45). 



Insert Table 1 about here 



The interaction between slope and variabil ity is depicted in 
Figure 1. When variability was constant, there was a linear increase 
in the ratings of intervention effectiveness across slope levels. 
When variability changed (either increased or decreased), similar 
ratings were given for both of the lower levels of slope (10 and 15 
degrees), regardless of the direction of the change. However, with a 
20 degree slope, , there was a substantiaTy'ncrease in the ratings of 
effectiveness when variability decreased, with little change in the 
rating when variability increased. 



Insert Figure 1 about here 



Data on the reliability of ratings for the three levels of slope 
arid four conditions of variability are summarized in Table 2. The 
relationship between slope and reliability appeared to be mediated by 
the influence of variability. Of the three levels df* slope, the 
lowest reliability occurred with the intermediate slope level (15 
degrees); While the greatest reliability occurred with the steepest 
slope (20 degrees), there was one exception. Under conditions of 
increased variability, the highest reliability occurred with the 
lowest slope (10 degrees). When variability decreased, the 
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reliability was highest when the slope was steep (20 'degrees). In 
these two conditions of variability, reliability deteriorated 
considerably with low increases 1n slops (from 10 to 15 degrees). 



Insert Table 2 about hsre 




The overall influence/ of variability on tPTe^average reliability 
of ratings was most pronounced when variability increased.— Under^that 
condition, the average reliability was 'the lowest. • The- difference- 
between the other conditions of variability, however, was considerably 
less. The effect of variability on reliability also appeared to be 
Mediated by the level of slope. With a low slope of 10 degrees, there 
was little change in reliability across the various conditions of 
variability. When the slope was higher. (15 and £0 degrees), 
reliability changed with changes in variability. For a 15 degree 
slope, reliability was highest when variability w,as constant (either 
low or high). In contrast, with a slope of 20 degrees, the 
reliability 'Was highest when' variability decreased or remained low and 
constant. 

There appeared to be little differential effect on the stability 
(reliability) of ratings from time 1 to time 2 under conditions of 
constant variability (see Table 3). Very similar findings appeared 
whether or not the variability had been low. 



Insert Table 3 about here 
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What Influence Does the Use of Aimlines and Training in Data 
Utilization, have on. Ratings of Intervention Effectiveness? 

The results of the rating of intervention effectiveness are 

summarized in Table 4. Although a significant effect was found for 

training, F(l,49) = 14.0, p < .000, there was no effect found for the 

use of aim1in£Sj F(l,49) = 0.36, p .552, or the interaction between 

the use of aimlines and training in data utilization, F(l,49) = 2.7, p 

jC .105. The average rating by trained subjects was less than the 

rating by untrained subjects. In contrast td this significant 

difference, nearly the same ratings were given when aimlines were 

present as when they were absent. - 




What Influence Does the Use of Aimlines and Trafning^jn Data 
Utilization Have on the Reliability of Ratings of Intervention 
Effectiveness? ' J 

There was little difference in the average reliability 

(consensus) across training and aimline conditions (see Table 5), with 

the range from .51-. 54. Trained subjects were slightly more reliable 

when aiml ines were present ( .54 vs .51) , whi le untrained subjects 

showed no difference in reliability across this dimension (.52). The 

difference in reliability between trained and untrained subjects was 

very slight (.01 to .03). 



Insert Table 5 about here 



A greater difference was apparent in the reliability of ratings 
over time for the training and aimline condition (see Table 6). 
Trained subjects were considerably more reliable from time 1 to time 2 
than untrained subjects, . regardless of the presence (or lackf of 
aimlines. While trained subjects were more reliable when aimlines 
were present tha/i when they were absent, untraihed subjects were 
actually more reliable without aimlines from time 1 to time 2. 



Insert Table 6, about here 



What Type of Data Dimensions are Utilized by Trained and Untrained 
Subjects in their Ratings of Intervention Effectiveness? 

For each' graph, subjects were asked to describe any 

characteristic of the data array that influenced their judgments. 

Their responses were categorized into nine dimensions of time series 

data that summarize and describe change over time. These categories 

were structured around 'various statistical summarizations, each one 

providing unique information for evaluating change in performance. In 

most cases, there were many different descriptions of any particular 

characteristic of the data, though reference was obviously to the same 

dimension. Following is a list 1 of the categories and a brief 

explanation/definition using the various terms listed by the subjects: 

(a) Progress - nonspecific statements of changes in performance 
over time . Synonomous terms included slope, upward 
(downward) movement, rate increases (decreases), 
acceleration, improvement, gains* 

Variability - descriptions of day-to-day variation in 
rformance. Synonomous terms included scatter, fluctuation, 
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range, (in)consistency,-(un)stab1e, steady, gradual, 
sporadic, (un)predictable. 

(c) Jump - immediate change in performance from the last day of 
baseline to the first day of the intervention phase. Other 
terms included changes in step, or level, and immediate 
increase (decrease) in performance. 

(d) Direction - comparison of slope from baseline to Intervention 
or within the intervention phase from the beginning to the 
end. Also included in this category were statements 
describing a leveling off or a previously flat (downward) 
slope as now increasing. 

(e) Number of Days of increases and decreases relative to .any 
index: previous days, baseline, slope, aimline, overlap. 
Statements that implied counting also were included, 
allowing for descriptions of performance as -being 
"consistently," "never," "always," "the majority of time" 
over (under) the above indices. 

(f) Goal /Aim - use of goals or aimlines to qualify 
interpretations of performance, including any comparison of 

. actual to expected performance. 

(g) Average Performance - use of a composite summarizing index 
for measuring change between baseline and intervention or 
within the intervention phase, from beginning to end, 
including mean, average, median, or percent. 

(h) Overlap - reference to the band within which scores fall 
across phases. Any statements taking note of simultaneous 
comparison of high and low points between phases were 
included in this- category. 

(i) Absolute Values - use of numbers from the graph representing 
single-point values , including high or low scores and/br 

the difference between them, or the last day of baseline, the 
last day of intervention and/or the difference between them. 

Table 7 contains the means . and standard deviations of the number 

of references made to each of the. characteristics. There was no 

difference between trained and untrained subjects on only two 

dimensions: progress, ahd the number of days improved. For the 

remaining dimensions there were significant differences between the 

two groups. Trained subjects referred more often to every , dimension 
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except absolute values. Untrained subjects referred to this 
characteristic significantly more often than trained subjects. In 
addition, the range of frequencies across the various dimensions was 
quite great. Reference was made most often to progress and 
variabil ity for both groups. The only dimension not used very 
frequently by^ trained subjects was absolute values. In contrast, 
untrained subjects rarely referred to jump, direction, and overlap. 

Insert Table 7 about here *■ 



. Another analysis of this same variable - frequency of reference 
to data characteristics - was conducted on the number of different 
dimensions mentioned for each graph. The results indicated a 
significant main effect for changes in slope, F(2,98) = 11.2, p <_ 
.000. The difference between the three levels of slope revealed an 
interesting relationship (see Table 8). More dimensions were referred 
to, when the slope was 15 degrees. In contrast, when the slope was 10 
or 20 degrees, this number dropped. No significant effects were found 
0 for variability, F(3,147) = .49, p < .690, or the interaction between 
slope and variability, F(6,294) = 1.1, p <. .348. 



Tnsert ^Table 8 about here 

\ 

- _•____.._ __^_____.._ ________ «_ 

\ 

Table 9 is a summary of the frequency of reference to data 
dimensions as a function of aimline condition and training condition. 
All three sources of variance were found to be significant - both main 

23 
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effects - aimlines, F(l,49) = 49.3, p < .000, and training, F(l,49) = 
39.5, p _< .000, as well as "the interaction between them, F(l,49) = 

* 

82.0, p <^ .000. Trained subjects used more dimensions than untrained 
subjects. Fewer dimensions were referenced when aimlines were present 
than when no aimlines were present. The interaction between training 
in data utilization and the use of aimlines appears in Figure 2. 
While there was no difference between trained and untrained subjects 
when aimlines were present, there was a great difference when <no 
aimlines were present. In this latter condition, trained subjects 
referred to a far greater number of dimensions than the untrained 
subjects. 



Insert Table 9 and Figure 2 about here 



Discussion 

In general, the findings from this" research are consistent and 
logical within- the. framework of data utilization. For instance, 
successively higher levels of slope were rated higher in intervention 
effectiveness and, for the four conditions of variability, the lowest 
ratings were given when variability either increased or was low and 
constant while the highest ratings were given when variability either 
decreased or was high and constant. Both of these interpretations 
would be consistent with established data utilization practice: 
steeper slopes mean higher (faster) rates of improvement and increased 
variability signifies lack (loss) of control of . those variables 
relevant to performance (Parsonson & Baer, 1978). Yet, the ratings of 
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interventions followed by high constant variability were higher than 
those followed , by low constant variability. The interpretation 
apparently is one of considering erratic performance as at least 
including some high scores, which was viewed as a more positive aspect 
than consistent control of performance. ' 

While the above finding was true in general, the presence of an 
interaction between slope and variability necessitates a qualification 
of that result. The effect of 'increased variability, relative to the 
other three conditions, reveals the highest rating of intervention 
effectiveness to occur when the slope is 10 degrees, the lowest by 
only a small margin to occur- when the slope is 15 degrees, and the 
lowest by a significant margin to occur when the slope is 20 degrees. 
That is, when tflere is minimal improvement over time (a low slope), 
increased variability is not viewed as a negative component of 
performance. As the rate of improvement increases, increases in 
variability result in lower ratings of effectiveness, relative to the 
other conditions. At the same time, if variability does not change, 
"but remains high, ratings of effectiveness also remain high, nearly 
the same as if variability had decreased. Again, some degree of 
variability actually is found acceptable and there is attention to 
large changes. ' 

This finding is somewhat in keeping with that reported by Wampold 
and Furlong (1981). In that study, subjects failed to appreciate the 
functional equivalence of two different time series in which the 
change in intervention effects were the same relative to the variation 

s .... " 

present. That is, a "steeper slope or trend (with proportionately 
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greater variability) should be rated the, same as a modest slope or 
trend, in which the variability is proportionately smaller.' In this 
study, subjects rated graphs with low slope and variability as 
reflections of no intervention effect and graphs with a high slope and 
high variability as reflecting a very strong intervention effects 
However, because variability was manipulated in both phases' in this 
study, it was possible to ascertain subjects' responses to this factor 
/within a time series (between phases), as well as; between different 
[ time series, a condition lacking in the Wampold and Furl'- ' (1981) 
study. In analyzing this factor, it is apparent that subj' -iacted- 
differentially to various changes in variability between phases. 

The findings for the two evaluation variables, the use of 
aimlines and training in data utilization, revealed less of an effect 
and less consistency in the effects. The use of aimlines did not 
appear to have any significant effect on the ratings of intervention, 
effectiveness. Subjects rated intervention effectiveness the same 
regardless of the presence (or lack) of aimlines. The apparent effect 
of training was to create a more cautious perspective in evaluating 
programs, with untrained subjects rating intervention effectiveness 
significantly higher than trained subjects. This may be, in part, a 
function of the number of data dimensions that trained subjects 
attended to during their evaluations. It is possible that trained 
subjects were attending to different elements of the data array in 
concert .and not simply responding to any one element. 

This characteristic of- time-series data - the capacity of 
generating several summary statistics - is both an advantage and a 



disadvantage. There is flexibility in summarizing performance in many 
different ways, allowing change to be reflected in a sensitive and 
appropriate manner. . At the same time, the use of such data becomes 
more problematic, because not all of the indices are changing in 
concord with each other. That is, when the data array depicts both an 
increase in slope and variability, judgments -of effects- may be 
tempered. Because the trained subjects had at their disposaT a more 
complete and detailed procedure for evaluating effects, it is possible 
that the net result was one of moderating conclusions of 
effectiveness. . """' r 

The lack of a. significant interaction between the use of aimlines 
and training in data utilization represents an interesting finding. 
It is possible that the training session was not effective and/or th^ 
skills developed as a result of training were not sufficient to 
differentiate that group of subjects from the untrained group-. That 
is, trained subjects, evaluated graphs with aimlines the same as .they 
evaluated, graphs with, no aimlines, failing to apply the decision rule 

s* 

criteria of three days above or below- the 'aimline. On the other hand, 
it is possible that the "critical factor in the training-aiml ine 
interaction is the aimline, not the training. A group of untrained 
subjects may be evaluating program effectiveness in*, the . same 
differential - manner on ".graphs with and'without aimlines. as subjects 
trained in the use of aimline decision rules. ' - ■ 

In the former case, the implication is that training should be 
more extensive than that implemented in this research. Although the 
procedures of analysis and the paradigm of evaluation were fully 
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were given ..very, little practice 
uation of the graphs. In the 



described and modeled* the* subjects 
and, no feedback prior to the^lr. eva'l 
latter argument, .the implication 1s that training 1n data utilization 
is unnecessary, as long as the graphs being evaluated contain 
aimlines. Explanation of decision rue criteria need not.be included 
either. A simple depiction of performance relative to an aimline is 
all that is necessary. 

Further support. for the lack of training hypothesis comes from an 
analysis of reliability. Not only was there no differential use of 

I « 

the data by trained and untrained subjects on graphs with and without 
aimlines, but little effect was found for either of the two factors on 
the reliability of ratings of intervention effectiveness. Untrained, 
subjects were nearly as reliable as trained subjects, and little 



difference existed in the use of aimlines. Although trained subjects 
were slightly more reliable on graphs with aimlines, untrained 
subjects were not. Therefore, the use of aimlines without training, 
does not appear to be a critical factor. 

In contrast to the lack of trjaining effects . on .the use of 
aimlines and reliability of ratings J there was an effect on the 
stability (reliability over time) of Ratings: Trained subjects were 
more reliable than untrained subjects; ratings* of effectiveness were 
' more reliable when aimlines were present; and there was an interaction 
\ between the use of aimlines and training in data utilization, with 
trained subjects more reliable on graphs with aimlines and untrained 
subjects more reliable on graphs without aimlines. In general, the 
range and absolute values of reliability coefficients are in keeping 
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with previous investigations (DeProspero & Cohen,. 1978; Jones et al., 
1977). Visual analysis of time series data nas modest reliability at 
best. 

Two factors that appear to -influence rel iabil ity include both ... 
slope and variability. The effect of variability was most pronounced . 
when : it increased (resulting in low reliability), with little 
difference among reliabilities in the other three conditions. The 
effect of slope was most noticeable when it was steep (resulting in 
the highest reliability). Generally, the differences between the 
reliability coefficients for J;he various conditions of variability 
increased as the slopes increased. When the slope was 10 degrees, the 
range was from .51-.53; the range was .46-. 64 for a slope of 20 
degrees. Th^is finding again indicates that not all data indices are 
equivalent stimulus dimensions for rating intervention effectiveness. 
When the slope is low, there is 1 ittle differentiation and the 
absolute level of reliability quite low (.52). When the slope iV 
steep, the reliability of ratings of effectiveness is very low (.46) 
when variability has increased, and modest (.64) when variability was 
low and constant. Nevertheless, the range is greater with a steeper >^ 

4 

• • ' c 

slope. 

A descriptive analysis of . the data dimensions utilized for 
evaluating effectiveness provides a partial explanation for the low 
levels of reliability and problems with training. SubjeiJts 1 responses 

I 

reflected the influence of many characteristics of the data* rather 
than any single dimension/' The , three most frequently cited . ' 
dimensions, however, were those that were manipulated in this study 
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-slope, variability, and aimllnes. In addition, several other 
characteristics appeared Influential, greatly expanding the type and 
frequency of Interactions possible. The dimensions attended to by the; 
trained subjects were both more varied and cited with greater 
frequency than those attended to by the untrained subjects. Thus, for 
any given graph, the subject's response was under the control of eight 
different characteristics (for trained subjects) or six different 
characteristics" (for untrained subjects), excluding those that rarely" 
were considered. ' 

There was also a difference in the kind of 3ata characteristics 
used by trained versus untrained subjects. The only dimension, 
consistently referred to more frequently by untrained subjects was 
absolute values. This particular characteristic is probably the mo§(t 
static, least informative, and most potentially biasing of any of the x 
possible dimensions. Given a time-series data array, the use of a 
single score to summarize change in performance has many problems, not 
the least of which J is the failure to .take advantage of that 
characteristic unique to time-series data - changes in scores 1 over 
-time. In contrast, the remaining characteristics reflect changes over 
time and consistently were referred to more frequently by trained . 
subjects. Furthermore, B there was an indication that the data array 
itself influenced the number of data characteristics mentioned. It 
appears that, in general, when the changes were more obvious, there 
was a reliance on fewer characteristics. For instance, the use of 
aimlines provided . a clear indication of relative improvement, 
resulting in less reliance 'an - other data characteristics. Or when 
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growth was either minimal (10 degrees) or maximal (20 degrees), fewer 
dimensions were referred to in the evaluation process. Finally, it 
was only when the data became unpredictable (high and constant or 
increased variability) that reference to other dimensions was 
increased. 

In conclusion, before an adequate and valid analysis of time 
""series data using /visual inspection can be establ ished, some 
consistent data utilization needs to occur. As this study has 
demonstrated, there are several factors that influence this process, 
including training in data analysis, and the data array itself. The 
fact that these influences all occur together simply makes the task at 
hand more difficult. The simple use of aimlines did not appear to 
result inherently in a better analysis. Rather a decision-making 
system needs to be empirically established that takes into account 
both the fact that judgment is based on several dimensions at the same 
time and that such factors often conflict with each other. 
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Table 1 

Average Rating of Intervention Effectiveness 
All Levels of the Slope and Variability Fact 



tors 



Variability 



10 ,J 



15 l 



Slope 



2(T 



Average 



Increase 


2.6 


2.2 


2.6 


2, 


, 5 


Decrease 


2.6 


2.3 


3.4 


2, 


,8 


Low 


1.9 


2.5 


3.0 


2, 


: 5 


High 


2.3 


2.6 


3.3 


2 


. 7 


Average 


2.3 


2.4 


3.1 


2 


.6 



,1 



,34 



" Table 2 

Comparison of Trained and Untrained Subjects on the Reliability of 
Ratings (Agreement/Agreement + Disagreement) for Each Combination 

Slope and .Variability . 



Slope 


Inc. Var. 


Dec. Var. 


Low Var. 


High Var. 


Average 


10° 


.51 


.52 


-.51 


.53 


.52 


15° 


•43 


\,48 


.51 


.52 


.49 ' 


20° 


.46 


.62 


/64 


.55 


.57 - 














Average 


.47 




.56 


.54 


. 53 
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Table 3 

Reliability of Ratings (Agreement/Agreement 
Disagreement) from Time 1 to Time 2 
for Graphs with VarlabU 1ty Manipulated 



Low High' 
.52 .54 



v 



Table 4 

The Average Rating of Intervention Effectiveness 
for Both Levels of the Aimline and Training Factors 





( Trained 


Untrained 


Average 


A1mline N \ x ^ 


2.4 


2.8 


, 2.6 


No Aiml 1ne V 


2.5 


.2.7 


2.6 


Average - \ 


v 2.5 


2.8 % ■'. 


2.6 



V 
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Table 5 

Reliability of Ratings (Agreement/Agreement + Disagreement) 
for Both Levels of the A1ml1ne and Training Factor 



Tra Ined 
Aimline No A1ml1ne 



Untrained 
A1ml 1ne No A1ml 1ne 



.54 



.51 



.52 4 



,52 



Table 6 - 

Reliability of Ratings (Agreement/Agreement + 
Disagreement) from Time 1 to Time 2 by Trained 
and Untrained Subjects for Graphs with Aimline Manipulated 



Independent 
Variable 


Tra ined 


Untrained 


Average 


Aimline 


.66 


.40 


•53 


Without Aimline 


.53 


.48 


.51 


Average 


.62 


.44 


.52 . 
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Table 7 



Average Number of References Made to Various Characteristics 
of, the Data by Trained and Untrained Subjects 



Data 

Characteristic 


X 


Tra 1 n ed 
S.D. Win.. 


Max. 


X 


Untrained 
S.D. Mln 


. Max. 


Progress (slope) 


• 18.5 


3.2 


14 


26 


19.7 


5.5 


7 


28 


Variability* 


21.8 


3.6 


12 


27 


15.7 


5.9 


5 


29 


Jump* 


9.4 


4.8 


1 


18 


.4 


1 .4 v 


0 


7 


Direction* 


6.8 


4.6 


0 

s 


19 


2.7 


4.2 


0 


- 18 


No. Days Improved 


7.9 


3.3 


1 


15 


6.0 


6.2 


0 


25 


Goal /Aim* 


12.0 


2.1 


4 


15 


8.7 . 


4.4 


0 


15 


Average Performance* 


13.5 


4.2 


7 


23 


6.2 


7.0 ; 


0 


24 .« 


Overlap* 


11 .4 


4.6 


2 


20 


1.0 


2.1 


0 


7 . 


Absolute Values* 


3.5 


3.5 


0 


15 


7.8 


6.2 


0 


26 


* 

Significant at p £ .05. 
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Table 0 

Number of Data Dimensions Referenced 
for All Levels of the Slope and Variability Factors 



Variability 


10° . 


Slope 
15° 


20° 


Average 


Increase 


2.8 


3.2 


2.7 


2.9 


Decrease 


2.8 


3.1 


2.9 


2.9 


Low, constant 


2.8 


3.0 


2.7 


2. ,9 


High, constant . 


2.7 


3.1 


2.9 


2.9 


Average 


2.8 


3.1 


2.8 


2.9 



0 I 



^ v Table 9 

Number of Data Dimensions Referenced for, Both 
Levels of the Aimllne and Training Factors 



Tra1 ned 



Untrained 



Average 



Aiml ine ■ . 
No Aimllne 
Average. 



2.5 

4 1 5 

3 5 



2-. 4 
2.2 
2.3 



2.5 
3.4 
2.9 
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Figure 1. Interaction between changes In slope and variability 
on the average rating of Intervention effectiveness. 
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p Aiml ine No Aiml i ne 

Use of Aiml ine 



Figure 2. Interaction between training in .data utilization and' 
the use of aimlines in the nunfber of d^ta character- 
istics mentioned. •/• : " 




, Appendix A ~ 

Procedures for Constructing Graphs 

• ■ * v ■ ■ ■ . • . «? 

The first step in constructing the graphs. involved drawing in the 

slope line: a slope of 0 degrees was drawn in during baseline and 

either 10, 15, or. 20 degrees drawn' in during the intervention. The 

lines that defined total bounce 'were then' drawn in. These lines were 

parallel to the slope line, with one passing through the' data point 

farthest above the slope and one passing through the data point 

farthest below the slope. Bounce around the slope line was kept 

> - 
nearly equidistant above and below the line. That Is*, if the total 

bounce involved five data points, two lines were drawn parallel to the 

slope line: one that was two data points above the slope line and one 

that was three data points below the slope line. If the total bounce 

was.-15_da.ta_P-Olnt.s_, ...the envelope included data points 7-8 units above 

and 7-8 units below the slope line, the graph at this point had a 

defined slope and variability that was used as a guideline in plotting 

the actual data points.- j 

Data points' then were plotted onto the graphs using the quarter- 

intersect method (Pennypacker,* Koenig, & Lindsley, 1972). A1J data 

points had to fall within the range of the total bounce. To use this 

procedure" - for systematically varying the slope, „a data point had to be 

determined that 'intersected the median of each half on the middle day 

of • that half. That is, using only the first half of the graph, the 

median was determined and plotted, at any point (on any day) during 

the first half. • Then an equal number of, data points above and below s 

'this "point were plotted on the remaining days. The median of! this 
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half, when plotted on the middle day of the half, ^defined a point 
through which the slope line would pass. 

The same procedure was used for the second half r of the graph. 
The median level necessary for the slope line to ' pass, through on the 
middle day of the* second half was determined and plotted, on "any day 
for that half. Following this, the remaining data points were plotted 
such that half of the data points fell above and half fell beloft this 
value. When , this median value was plotted on the middle day, the . 
slope line would pass through it. This entire process resulted iii a 
data, pattern having a given slope and variability. 

In generating a data array during baseline, j:he slope line was 

kept horizontal (with' a slope of zero). However, for the data during 

the intervention^ the slope was predetermined at some fixed value (10, 

15, or 20 degrees). In order to provide an adequate test of the 

influence of. slope alone in determining judgments, the, change in step 

(jump or level) from baseline to intervention was kept minimal. The 

difference between the last data day of baseline and the first data 

day of the intervention phase was kept to a maximum of two data % 

points. Given this one constraint on the actual value of the data 

points plotted, all others were plotted In a random manner, given the 

\ \, ^ ■■ ■ 

particular levels of slop°e arid variability. 

• i ' 

♦ . ■ ■ • . 
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Appendix B 
Evaluation Response Form 

1. Rate whether. the instructional program was an effective one for 
increasing the student's reading rate. 

1 2 3 ' 4* , 

. Definitely Possibly Moderately Very v 

Not Effective "Effective Effective 

Effective 

2. What about the student's performance makes you think so? 
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